add batch_bytes configuration for Flint #329

penghuo · 2024-05-02T00:24:12Z

Description

Change the default value

spark.datasource.flint.write.batch_size: 1000 --> Integer.MAX_VALUE.
spark.datasource.flint.write.refresh_policy: wait_for --> false.

Add new settings

spark.datasource.flint.write.batch_bytes: The approximately amount of data in bytes written to Flint in a single batch request. The actual data write to OpenSearch may more than it. Default value is 1mb. The writing process checks after each document whether the total number of documents (docCount) has reached batch_size or the buffer size has surpassed batch_bytes. If either condition is met, the current batch is flushed and the document count resets to zero.

Test Result

With EMR-S 3 Executors (4v CPU 16 GB memory), write to single node OpenSearch cluster, throughput is 40MB/s.

Issues Resolved

#304

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Peng Huo <[email protected]>

flint-core/src/main/scala/org/opensearch/flint/core/storage/FlintWriter.java

flint-core/src/main/scala/org/opensearch/flint/core/storage/OpenSearchWriter.java

Signed-off-by: Peng Huo <[email protected]>

dai-chen

Thanks for the changes!

penghuo added 4 commits May 1, 2024 17:21

add batch_bytes for FlintWriter

4ffb413

Signed-off-by: Peng Huo <[email protected]>

change DEFAULT_REFRESH_POLICY to false

7396f61

Signed-off-by: Peng Huo <[email protected]>

Fix IT

9b35a59

Signed-off-by: Peng Huo <[email protected]>

update doc

5861017

Signed-off-by: Peng Huo <[email protected]>

penghuo marked this pull request as ready for review May 2, 2024 17:32

penghuo requested review from dai-chen, rupal-bq, vmmusings, seankao-az, anirudha, kaituo and YANG-DB as code owners May 2, 2024 17:32

penghuo self-assigned this May 2, 2024

penghuo added the 0.4 label May 2, 2024

dai-chen reviewed May 2, 2024

View reviewed changes

flint-core/src/main/scala/org/opensearch/flint/core/storage/FlintWriter.java Show resolved Hide resolved

flint-core/src/main/scala/org/opensearch/flint/core/storage/OpenSearchWriter.java Outdated Show resolved Hide resolved

penghuo changed the title ~~add batch_bytes for FlintWriter~~ add batch_bytes configuration for Flint May 2, 2024

address comments

cbbaee3

Signed-off-by: Peng Huo <[email protected]>

dai-chen approved these changes May 3, 2024

View reviewed changes

penghuo merged commit d9c0ba8 into opensearch-project:main May 3, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add batch_bytes configuration for Flint #329

add batch_bytes configuration for Flint #329

penghuo commented May 2, 2024 •

edited

Loading

dai-chen left a comment

add batch_bytes configuration for Flint #329

add batch_bytes configuration for Flint #329

Conversation

penghuo commented May 2, 2024 • edited Loading

Description

Change the default value

Add new settings

Test Result

Issues Resolved

dai-chen left a comment

Choose a reason for hiding this comment

penghuo commented May 2, 2024 •

edited

Loading